RNA-seq data

The provided dataset comprises gene expression counts obtained from RNA-seq analysis of Gilthead sea bream gonads. The matrix contains information on 2617 genes across 18 samples, each representing a distinct gonadal tissue sample obtained from three biological replicates across 6 stages of development. The average gene count across all samples is 408.3001, indicating the expression level of genes in the dataset. The standard deviation of 1069.982 highlights the variability in gene expression, and the range, spanning from 0 to 44281, underscores the diversity in expression counts across the examined genes. This dataset serves as a valuable resource for investigating the molecular dynamics associated with Gilthead sea bream gonadal development, offering insights into the gene expression patterns that characterize different developmental stages.






Data mining

Data normalization and log-correction

First we visualize the gene counts after simple normalization, the point being to ensure that all the samples have comparable gene counts.






Average gene expression

In our investigation, we explore the impact of developmental stages on average gene expression. Our findings reveal a distinctive pattern: a subset of genes sampled in the gonads exhibits high expression levels during the initial two stages but experiences down-regulation as the gonads continue to develop. Conversely, the remaining portion of the gene population demonstrates relatively low expression in stages 1 and 2, progressively up-regulating as the gonads mature. We observe a pivotal intermediate stage (stage 3) when this transition occurs (on average).






Correlation between samples

We also investigate how the different samples behave in comparison to each other to identify outliers or aberrant samples. To do so, we build the correlation plot between every sample pair. Sudden drops in correlation coefficients might indicate technical issues or sample outliers, while clusters of samples with high correlations may suggest common biological characteristics or batch effects.






Initial Quality control

Regular PCA

2 principal components

The PCA performed on normalized gene counts show that the different samples mostly cluster per developmental stages. Stages 1, 2, and 3: They present very different clusters which suggest a strong change in gene expression during these first stages. Stages 4, 5, and 6: They cluster closer together which suggests there are less changes happening during these last stages of gonads developments.



3 principal components

We extend the plot to the third PC axis of the PCA. One sample in the first developmental stage is slightly away from the rest of its cohort. It could be an asynchronous sample which already experiences some changes in gonad development. A similar pattern can be observed in the 5th development stage. Since we have only 3 biological duplicates per stage, we cannot conclude to any significant outlier.



Generalised PCA

Given that the data are heavily skewed into a heavily-tail distribution, a regular PCA maybe present slightly distorted projections and make it look as though some point strongly diverge from their cluster as an artifact of skewed gene counts. Hence we also perform a Generalized PCA using a Negative Binomial distribution with feature-specific over dispersion to gain further insights into our data.

Samples

We retrieve the behavior displayed in the correlation heatmap where stages 1 and 2 cluster separately whereas stage 3 clusters closer to stages 4, 5, and 6. Furthermore, we find that samples 1_3, 3_3, and 5_2 stand aside from the other samples in there respective developmental stages. Given the sample size, we cannot determine whether these samples are real outliers, hence we do not remove these points but head towards a weighted differential gene expression analysis.






Genes

We also perform the GPCA on the genes to investigate whether they cluster into different groups, which can sometimes hint at patterns into the coming DEG analysis.






Outlier detection

Rare genes

We further check the gene list for rare genes which expression could bias the overall differential expression analysis: most rare genes were already filtered out of the data set and no further filtering is necessary.






Sample quality

After filtering out potentially rare genes, we standardize gene expression and re-run a VOOM analysis to determine sample quality: sample 5_2 should be given less weight in the differential gene expression analysis than other samples. As highlighted in the QC before, sample 5_2 displayed a slightly different behavior compared to other replicates in the 5th developmental stage. It could be that this individual have either more advanced or delayed gonad development compared to the animals sampled along them. The VOOM analysis also detects samples from developmental stages 1 and 3 as potentially weak samples. This is expected as these two developmental stages show about 2/3 of their genes being little expressed in the heat map presented before. Compared to other developmental stages where more genes become highly expressed, these samples seem to be of poor quality but it is a by-product of the time dependency.

Note that if we remove the time-dependence (by transforming the stage variable into a factor), the first two stages no longer are detected as weak samples. Instead, samples 1_3, 3_3, and 5_2 (which were already identified as deviant in the GPCA) stand out as poor quality samples.






Differential Gene Expression Analysis

(With repeated measures design)

LIMMA-VOOM analysis

The LIMMA approach relies on linear Bayesian models to approximate the effect of the development stage on sea bream gene expressions. It also applies quantile normalization to account for differences in gene expression across samples and controls for stage-dependent variance in gene expression.

The average correlation between genes from samples originating from the same biological replicates is -0.1641412. We run a differential gene expression analysis per block while accounting for the correlation between genes across developmental stages.

##        (Intercept) stage
## Down            36   170
## NotSig         179  1900
## Up            2377   522






DESeq2 analysis

## 
## out of 2617 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 940, 36%
## LFC < 0 (down)     : 944, 36%
## outliers [1]       : 30, 1.1%
## low counts [2]     : 0, 0%
## (mean count < 86)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results






Agreement between analyses

Whether we use the LIMMA-VOOM or the DESeq method will affect the panel of genes detected as differentially expressed across developmental stage. Both methods fully agree on the set of genes that are down-regulated as gonads develop but disagreements occur when analyzing which genes are up-regulated and which ones are non-significant.

For instance, 14 genes were detected to be upregulated by the DESeq analysis but were non-significant in the LIMMA-VOOM. And 22 genes were found to be upregulated by the LIMMA-VOOM analysis but were non-significant in the DESeq analysis.

Feel free to go through the list of genes, see if there are problematic results between the two models, or if one analysis is more interesting for your project than the other.






Gene annotation

Genes were annotated using the NCBI database (“ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/”), version “fSpaAur1.1”.

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Sparus aurata
## | SPECIES: Sparus aurata
## | CENTRALID: GID
## | Taxonomy ID: 8175
## | Db type: OrgDb
## | Supporting package: AnnotationDbi






Gene Co-expression Network

Building the co-expression network

## TOM calculation: adjacency..
## ..will not use multithreading.
##  Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
##  ..cutHeight not given, setting it to 0.919  ===>  99% of the (truncated) height range in dendro.
##  ..done.
##  mergeCloseModules: Merging modules whose distance is less than 0
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 8 module eigengenes in given set.
##    Calculating new MEs...
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 8 module eigengenes in given set.



Co-expressed genes

Genes are grouped based on how similarly they are expressed across developmental stages. The grouping relies on a simple hierarchical clustering algorithm and the optimal partition was determined using the dynamic modularity algorithm (PART). Among all the genes which expression levels changed as gonads developed, 8 trends can be identified (i.e., modules of co-expression). We provide a heat map and a graph-like representation of the co-expression network. In the graph, we represent the minimal spanning tree of the co-expression modules separately.

Network heatmap



Network graph



Characteristics of modules of co-expression

Modules vs. samples

For each sample in the study, we determine if one or several blocks of co-expressed genes are particularly represented. All samples collected during stages 1 and 2 show a strong expression of genes belonging to the modules “blue” and “pink”. Samples collected during the 3rd developmental stage display a lower expression of these two modules and a slightly higher expression of all other modules of co-expressed genes. Finally, samples collected later during the development of the gonads show a low expression of genes belonging to the “pink” and “blue” modules and a strong expression of all other modules. The 3rd developmental stage appears to be a transition between stages 1 and 2, on the one hand, and stages 4, 5, and 6 on the other hand. Stages 4,5, and 6, show similar gene expression patterns, suggesting a stabilization of gene expression during these stages.

As detected before, sample 5_2 shows a divergent trend in gene expression.






Modules vs. Samples characteristics

We investigate how the temporal trend in gene expression in each module of co-expressed genes correlates with sample characteristics.
No effects of the biological replicates
Down-regulation of genes in modules “blue” and “pink”
Up-regulation of genes in all other modules






Modules vs. Biological functions

Genes are further clustered by biological functions and the results of this grouping are represented along patterns of up/down-regulation and co-expression:
Down-regulated:
Blue module:
– component c3b tandem duplicate 2,
– ETS domain-containing TF ERF-like,
– peptidyl-prolyl cis-trans isomerase FKBP9-like,
Pink module:
– NADH: ubiquinone oxidoreductase subunit B9,
– component c3b tandem duplicate 2






Enriched Gene Ontology

The GO enrichment analysis is performed using the g:Profiler tool (v.1.2., “http://biit.cs.ut.ee/gprofiler/”) which bases its Gene Ontology on the Ensembl Database. Graphs are simplified to allow easy inspection but may prevent relevant GO terms to display.

We can refine the graphs together if you need.

## $biomart
## [1] "Ensembl"
## 
## $biomart_version
## [1] "110"
## 
## $display_name
## [1] "Gilthead seabream"
## 
## $genebuild
## [1] "fSpaAur1.1"
## 
## $gprofiler_version
## [1] "e110_eg57_p18_4b54a898"
## 
## $organism
## [1] "saurata"
## 
## $sources
## $sources$`GO:BP`
## $sources$`GO:BP`$name
## [1] "biological process"
## 
## $sources$`GO:BP`$version
## [1] "annotations: BioMart\nclasses: releases/2023-07-27"
## 
## 
## $sources$`GO:CC`
## $sources$`GO:CC`$name
## [1] "cellular component"
## 
## $sources$`GO:CC`$version
## [1] "annotations: BioMart\nclasses: releases/2023-07-27"
## 
## 
## $sources$`GO:MF`
## $sources$`GO:MF`$name
## [1] "molecular function"
## 
## $sources$`GO:MF`$version
## [1] "annotations: BioMart\nclasses: releases/2023-07-27"
## 
## 
## $sources$HP
## $sources$HP$name
## [1] "Human Phenotype Ontology"
## 
## $sources$HP$version
## [1] "annotations: 09.2023\nclasses: None"
## 
## 
## 
## $taxonomy_id
## [1] "8175"

Up-regulated genes

Manhattan plot

The graph is interactive. Hover your mouse to see the details of each GO term.



Table



Graph

The graph is interactive. Zoom in and out, move the dots around.

Pink nodes represent enriched GO terms as detected by the g:Profiler tool. Each GO term node carries its name as defined in the Ensembl Database. Black nodes are the differentially expressed genes detected in the DESeq analysis which relate to the GO term they are connected to.



Down-regulated genes

Manhattan plot

The graph is interactive. Hover your mouse to see the details of each GO term.






Table






Graph

The graph is interactive. Zoom in and out, move the dots around.

Pink nodes represent enriched GO terms as detected by the g:Profiler tool. Each GO term node carries its name as defined in the Ensembl Database. Black nodes are the differentially expressed genes detected in the DESeq analysis which relate to the GO term they are connected to.






Enriched KEGG pathway

Coming soon!

Plot